{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "数据说明： Pima Indians Diabetes Data Set（皮马印第安人糖尿病数据集） 根据现有的医疗信息预测5年内皮马印第安人糖尿病发作的概率。\n",
    "\n",
    "数据集共9个字段: 0列为pregnants(怀孕次数)； 1列为Plasma_glucose_concentration(口服葡萄糖耐量试验中2小时后的血浆葡萄糖浓度)； 2列为blood_pressure(舒张压,单位:mm Hg） 3列为Triceps_skin_fold_thickness(三头肌皮褶厚度,单位：mm） 4列为serum_insulin(餐后血清胰岛素,单位:mm） 5列为BMI,体重指数（体重（公斤）/ 身高（米）^2） 6列为Diabetes_pedigree_function(糖尿病家系作用) 7列为Age(年龄) 8列为Target(分类变量,0或1\n",
    "\n",
    "下面是采用logistic回归模型进行训练建模。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 92,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 首先 import 必要的模块\n",
    "import pandas as pd \n",
    "import numpy as np\n",
    "\n",
    "from sklearn.model_selection import GridSearchCV\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 读取数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 93,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pregnants</th>\n",
       "      <th>Plasma_glucose_concentration</th>\n",
       "      <th>blood_pressure</th>\n",
       "      <th>Triceps_skin_fold_thickness</th>\n",
       "      <th>serum_insulin</th>\n",
       "      <th>BMI</th>\n",
       "      <th>Diabetes_pedigree_function</th>\n",
       "      <th>Age</th>\n",
       "      <th>Target</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.639947</td>\n",
       "      <td>0.866045</td>\n",
       "      <td>-0.031990</td>\n",
       "      <td>0.670643</td>\n",
       "      <td>-0.181541</td>\n",
       "      <td>0.166619</td>\n",
       "      <td>0.468492</td>\n",
       "      <td>1.425995</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-0.844885</td>\n",
       "      <td>-1.205066</td>\n",
       "      <td>-0.528319</td>\n",
       "      <td>-0.012301</td>\n",
       "      <td>-0.181541</td>\n",
       "      <td>-0.852200</td>\n",
       "      <td>-0.365061</td>\n",
       "      <td>-0.190672</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1.233880</td>\n",
       "      <td>2.016662</td>\n",
       "      <td>-0.693761</td>\n",
       "      <td>-0.012301</td>\n",
       "      <td>-0.181541</td>\n",
       "      <td>-1.332500</td>\n",
       "      <td>0.604397</td>\n",
       "      <td>-0.105584</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>-0.844885</td>\n",
       "      <td>-1.073567</td>\n",
       "      <td>-0.528319</td>\n",
       "      <td>-0.695245</td>\n",
       "      <td>-0.540642</td>\n",
       "      <td>-0.633881</td>\n",
       "      <td>-0.920763</td>\n",
       "      <td>-1.041549</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>-1.141852</td>\n",
       "      <td>0.504422</td>\n",
       "      <td>-2.679076</td>\n",
       "      <td>0.670643</td>\n",
       "      <td>0.316566</td>\n",
       "      <td>1.549303</td>\n",
       "      <td>5.484909</td>\n",
       "      <td>-0.020496</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   pregnants  Plasma_glucose_concentration  blood_pressure  \\\n",
       "0   0.639947                      0.866045       -0.031990   \n",
       "1  -0.844885                     -1.205066       -0.528319   \n",
       "2   1.233880                      2.016662       -0.693761   \n",
       "3  -0.844885                     -1.073567       -0.528319   \n",
       "4  -1.141852                      0.504422       -2.679076   \n",
       "\n",
       "   Triceps_skin_fold_thickness  serum_insulin       BMI  \\\n",
       "0                     0.670643      -0.181541  0.166619   \n",
       "1                    -0.012301      -0.181541 -0.852200   \n",
       "2                    -0.012301      -0.181541 -1.332500   \n",
       "3                    -0.695245      -0.540642 -0.633881   \n",
       "4                     0.670643       0.316566  1.549303   \n",
       "\n",
       "   Diabetes_pedigree_function       Age  Target  \n",
       "0                    0.468492  1.425995       1  \n",
       "1                   -0.365061 -0.190672       0  \n",
       "2                    0.604397 -0.105584       1  \n",
       "3                   -0.920763 -1.041549       0  \n",
       "4                    5.484909 -0.020496       1  "
      ]
     },
     "execution_count": 93,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 读取数据\n",
    "# path to where the data lies\n",
    "train = pd.read_csv(\"FE_pima-indians-diabetes.csv\")\n",
    "train.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  数据基本信息\n",
    "样本数目、特征维数\n",
    "每个特征的类型、空值样本的数目、数据类型"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 108,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 768 entries, 0 to 767\n",
      "Data columns (total 9 columns):\n",
      "pregnants                       768 non-null float64\n",
      "Plasma_glucose_concentration    768 non-null float64\n",
      "blood_pressure                  768 non-null float64\n",
      "Triceps_skin_fold_thickness     768 non-null float64\n",
      "serum_insulin                   768 non-null float64\n",
      "BMI                             768 non-null float64\n",
      "Diabetes_pedigree_function      768 non-null float64\n",
      "Age                             768 non-null float64\n",
      "Target                          768 non-null int64\n",
      "dtypes: float64(8), int64(1)\n",
      "memory usage: 54.1 KB\n"
     ]
    }
   ],
   "source": [
    "train.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 数据准备"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 94,
   "metadata": {},
   "outputs": [],
   "source": [
    "y_train = train['Target']   \n",
    "X_train = train.drop([ \"Target\"], axis=1)\n",
    "\n",
    "#保存特征名字以备后用（可视化）\n",
    "feat_names = X_train.columns \n",
    "\n",
    "#sklearn的学习器大多之一稀疏数据输入，模型训练会快很多\n",
    "#查看一个学习器是否支持稀疏数据，可以看fit函数是否支持: X: {array-like, sparse matrix}.\n",
    "#可自行用timeit比较稠密数据和稀疏数据的训练时间\n",
    "#from scipy.sparse import csr_matrix\n",
    "#X_train = csr_matrix(X_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5折交叉验证默认参数的Logistic Regression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 95,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LogisticRegression\n",
    "lr = LogisticRegression()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 96,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "logloss of each fold is:  [0.48797856 0.53011593 0.4562292  0.422546   0.48392885]\n",
      "cv logloss is: 0.47615970944434044\n"
     ]
    }
   ],
   "source": [
    "# 交叉验证用于评估模型性能和进行参数调优（模型选择）\n",
    "#分类任务中交叉验证缺省是采用StratifiedKFold\n",
    "#这里我们采用5折交叉验证\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "from sklearn.model_selection import cross_val_score\n",
    "loss = cross_val_score(lr, X_train, y_train, cv=5, scoring='log_loss')\n",
    "#%timeit loss_sparse = cross_val_score(lr, X_train_sparse, y_train, cv=5, scoring='neg_log_loss')\n",
    "print ('logloss of each fold is: ',-loss)\n",
    "print ('cv logloss is:',np.mean(-loss))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Logistic Regression + GridSearchCV正则超参数调优"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "logistic回归的需要调整超参数有：C（正则系数，一般在log域（取log后的值）均匀设置候选参数）和正则函数penalty（L2/L1） \n",
    "目标函数为：J =  C* sum(logloss(f(xi), yi)) + penalty \n",
    "\n",
    "在sklearn框架下，不同学习器的参数调整步骤相同：\n",
    "1. 设置参数搜索范围\n",
    "2. 生成学习器实例（参数设置）\n",
    "3. 生成GridSearchCV的实例（参数设置）\n",
    "4. 调用GridSearchCV的fit方法"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1、采用5折交叉验证，用log似然损失，对Logistic回归模型的正则超参数调优"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 97,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "GridSearchCV(cv=5, error_score='raise',\n",
       "       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n",
       "          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,\n",
       "          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,\n",
       "          verbose=0, warm_start=False),\n",
       "       fit_params=None, iid=True, n_jobs=4,\n",
       "       param_grid={'penalty': ['l1', 'l2'], 'C': [0.1, 1, 10, 100, 1000]},\n",
       "       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',\n",
       "       scoring='log_loss', verbose=0)"
      ]
     },
     "execution_count": 97,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.model_selection import GridSearchCV\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
    "#需要调优的参数\n",
    "# 请尝试将L1正则和L2正则分开，并配合合适的优化求解算法（slover）\n",
    "#tuned_parameters = {'penalty':['l1','l2'],\n",
    "#                   'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]\n",
    "#                   }\n",
    "penaltys = ['l1','l2']\n",
    "Cs = [ 0.1, 1, 10, 100, 1000]\n",
    "tuned_parameters = dict(penalty = penaltys, C = Cs)\n",
    "\n",
    "lr_penalty= LogisticRegression(solver='liblinear')\n",
    "grid= GridSearchCV(lr_penalty, tuned_parameters,cv=5, scoring='log_loss',n_jobs = 4,)\n",
    "grid.fit(X_train,y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 100,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.4760259454061601\n",
      "{'C': 1, 'penalty': 'l1'}\n"
     ]
    }
   ],
   "source": [
    "# examine the best model\n",
    "print(-grid.best_score_)\n",
    "print(grid.best_params_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 113,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZIAAAEKCAYAAAA4t9PUAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAIABJREFUeJzt3Xl8FvW59/HPlRCIbKIsCgQaQFBAkSUVkLi0VkWkgKKARTy2+nBsweU8rT3Wtp5We1zr8iCoUKqV4q5FKaCcakXAAhIUWUWBYzUCghEBlT3X88dM0tuQ5Q6TyR2S7/v1mldm+c3c10yWb2Y3d0dERORwpaW6ABERObIpSEREJBIFiYiIRKIgERGRSBQkIiISiYJEREQiUZCIiEgkChIREYlEQSIiIpHUS3UB1aFFixaenZ2d6jJERI4oy5Yt+8zdW1bUrk4ESXZ2Nnl5eakuQ0TkiGJm/0ymnQ5tiYhIJAoSERGJREEiIiKR1IlzJCJSd+3fv5/8/Hz27NmT6lJqrMzMTLKyssjIyDis+RUkIlKr5efn06RJE7KzszGzVJdT47g7BQUF5Ofn06FDh8Nahg5tiUittmfPHpo3b64QKYOZ0bx580h7bAoSEan1FCLli7p9FCQiIiWMnLyIkZMXpbqMI4aCpBz6YRKRqtC4cePi/oEDB9KsWTMGDx5cattx48bRs2dPunXrxlFHHUXPnj3p2bMnzz//fNKfN2PGDO65557IdSdLJ9tFRKrRjTfeyNdff83kyZNLnT5p0iQAPvzwQwYPHszy5ctLbXfgwAHq1Sv9T/hFF11UNcUmSXskIiLV6JxzzqFJkyaHNW9ubi6//OUvOfPMM5k4cSIvvfQSffv2pVevXpx33nls3boVgKlTp3LDDTcAcPnll3P99ddz+umn07FjR2bMmFFl61JEeyQiUmf89q+rWbNpZ4Xt1mwO2iRzaLtbm6b81/e7R64tWTt37mT+/PkAbN++nSFDhmBmPPLII9x7773cddddh8yzdetW3nzzTVauXMmIESOqfI9FQSIicgQZNWpUcf9HH33EiBEj2LJlC3v37qVLly6lzjNs2DDMjB49evDJJ59UeU0KEhGpM5LdcyjaE3nm3/vHWc5hadSoUXH/uHHjuPnmmxk0aBCvvvoqd955Z6nzNGjQoLjf3au8Jp0jERE5Qu3YsYO2bdvi7jz++OMpq0NBIiJSjc444wwuvfRSXnvtNbKyspg7d+5hL+s3v/kNF110EWeddRbHHXdcFVZZOTq0JSISsy+//LK4f8GCBUnNk52dzapVq74xbuHChd8YHj58OMOHDz9k3quvvrq4f/r06WXWUlUUJCIiJdTEcyM1mQ5tiYhIJAoSERGJREEiIiKRKEhERCQSBYmISEmPXRh0khQFiYhIzIoeI798+XL69+9P9+7d6dGjB88888whbfUYeRERKVPDhg2ZNm0anTt3ZtOmTfTp04fzzz+fZs2aFbc5Eh8jryAREakmiQ9VbNOmDa1atWLbtm3fCJLy5ObmctZZZ7FgwQIuvvhiOnTowO23386+ffto2bIl06dPp1WrVkydOpVVq1bxwAMPcPnll9O8eXOWLl3Kli1buPfee/X0XxGRw/byTbBlZcXttqwIviZznuT4U+CC0h+WWJ633nqLffv20alTp0rNp8fIi4gImzdvZsyYMTz++OOkpVXuVLUeIy8ikkrJ7jkU7Yn8cHaVl7Bz504uvPBCfve739GvX79Kz6/HyIuI1GH79u3joosu4oorruDSSy+NvDw9Rl5EpI559tlnmT9/Pn/605+KL+st66qsZNSUx8hbHLs5xQs3Gwj8PyAdmOrupe53mdklwHPAt909z8xGAzcmNOkB9Hb35WY2D2gN7A6nnefuW8urIycnx/Py8ipdf01+S5qIJGft2rV07dq1cjPFeGirpiptO5nZMnfPqWje2M6RmFk6MAk4F8gHlprZTHdfU6JdE+A6YEnROHd/AnginH4K8JK7J8b2aHevfDKIiCSjDgVIVYjz0NZpwHp33+ju+4CngaGltLsNuBvYU8ZyLgOeiqdEERGJKs4gaQt8nDCcH44rZma9gHbuPquc5Yzk0CB5zMyWm9mvzcyqpFoRqbXiPIRfG0TdPnEGSWl/4IurNbM04H7gp2UuwKwv8LW7J75vcrS7nwKcEXZjyph3rJnlmVnetm3bDqd+EakFMjMzKSgoUJiUwd0pKCggMzPzsJcR530k+UC7hOEsYFPCcBPgZGBeuFNxPDDTzIYknP8YRYm9EXf/JPy6y8yeJDiENq3kh7v7FGAKBCfbq2KFROTIk5WVRX5+PvqHsmyZmZlkZWUd9vxxBslSoLOZdQA+IQiFHxRNdPcdQIui4fBqrJ8VhUi4x3IpcGZCm3pAM3f/zMwygMHAqzGug4gc4TIyMujQoUOqy6jVYgsSdz9gZuOBuQSX/z7q7qvN7FYgz91nVrCIM4F8d9+YMK4BMDcMkXSCEPlDDOWLiEiSYn1EirvPAeaUGHdLGW3PLjE8D+hXYtxXQJ8qLVJERCLRne0iIhKJgkRERCJRkIiISCQKEhERiURBIiIikShIREQkEr0hsRy3FBQ9yX5hSusQEanJtEciIiKRKEhERCQSBYmIiESiIBERkUgUJCIiEomCREREIlGQiIhIJAoSERGJREEiIiKRKEhERCQSBYmIiESiIBERkUgUJCIiEomCREREIlGQiIhIJAoSERGJREEiIiKRKEhERCQSBYmIiESiIBERkUgUJFJlRk5exMjJi1JdhohUs3qpLkBqj1sKbgz7Fqa0DhGpXtojEUmR1bfnsvr23FSXccTQ9qqc6txeChIREYlEQSIiIpEoSEREJJJYg8TMBprZOjNbb2Y3ldPuEjNzM8sJh0eb2fKErtDMeobT+pjZynCZE8zM4lwHEREpX2xBYmbpwCTgAqAbcJmZdSulXRPgOmBJ0Th3f8Lde7p7T2AM8KG7Lw8nPwyMBTqH3cC41kFERCoW5x7JacB6d9/o7vuAp4GhpbS7Dbgb2FPGci4DngIws9ZAU3df5O4OTAOGVXnlIiKStDiDpC3wccJwfjiumJn1Atq5+6xyljOSMEjC+fPLW2ZVWnOgLQWFjeNavIhIrRDnDYmlnbvw4olmacD9wJVlLsCsL/C1u69KZpkl5h1LcAiM9u3bJ1dxgv0HC7ln9xD2UY+HNxTQv1PzSi9DRKQuiHOPJB9olzCcBWxKGG4CnAzMM7MPgX7AzKIT7qFR/GtvpGiZWeUss5i7T3H3HHfPadmyZaWLz0hP47cNn6Wx7WH01MU8PG8DwdE0ERFJFGeQLAU6m1kHM6tPEAoziya6+w53b+Hu2e6eDSwGhrh7HhTvsVxKcG6laJ7NwC4z6xderXUF8FJcK9A+vYD7Gz3OBae05q5X3mPsn5exY/f+uD5OROSIFFuQuPsBYDwwF1gLPOvuq83sVjMbksQizgTy3X1jifE/BqYC64ENwMtVWPYhGto+Jl7Wi//6fjdef28rQyYuZPWmHXF+pIjIESXWhza6+xxgTolxt5TR9uwSw/MIDneVbJdHcEis2pgZPxzQgR5ZRzPuiXe4+KF/cNuwkxmR067imUVEajnd2V4Jfb51LLOuyyUn+xh+/vwK/vP5FezZfzDVZYmIpJSCpJJaNG7AtB/1Zfx3TuCZvI8Z/vA/+Kjg61SXJSKSMgqSw5CeZvzs/BN59Moc8rfvZvCDC3h1zaepLktEJCUUJBF896TjmHVtLu2bN+TqaXnc9cp7HDhYmOqyRESqlYIkonbHNuT5a07nstPa8/C8DYz541ts27U31WWJiFQbBUkVyMxI546LT+H3l57K2x9t58IJC1j64eepLktEpFooSKrQJX2yeHHcABrWT2fUlMVMXbBRd8OLSK2nIKliXVs3Zea1uZxzUit+N3st4558m117dDe8iNReCpIYNM3MYPKYPtw86CTmrv6UoRPfZN2WXakuS0QkFgqSmJgZY8/sxJNX92XX3gMMm/QmM97Jr3hGEZEjTKWDxMzSzKxpHMXURn07Nmf2tbmcknU0//HMu/zqxZXsPaC74UWk9kgqSMzsSTNramaNgDXAOjO7Md7Sao9WTTN58uq+/PuZHZm++CNGPLKI/O26G15Eaodk90i6uftOgtfazgHaE7xLXZJULz2NXwzqyuQxfdi47SsGP7iQeeu2prosEZHIkg2SDDPLIAiSl9x9P2W8mVDKd3734/nrtbkc3zSTH/5pKff97X0OFmpTisiRK9kgmQx8CDQC5pvZt4CdcRVV22W3aMSMnwxgeO8sJrz2AVc+9haff7Uv1WWJiByWpILE3Se4e1t3H+SBfwLfibm2Wu2o+uncc0kP7rz4FJb87+cMnrCAdz7anuqyREQqLdmT7deHJ9vNzP5oZm8D3425tlrPzBh1Wnv+8uPTSUszRkxexLRFH+pueBE5oiR7aOtH4cn284CWwA+BO2Orqo45ue3RzL72DM7s3JJbXlrN9U8v56u9B1JdlohIUpINEgu/DgIec/d3E8ZJFTi6YQZ/uCKHG88/kVkrNjFs0pus3/plqssSEalQskGyzMz+hyBI5ppZE0Av3qhiaWnGuO+cwJ+v6svnX+1j6MSFzFqxKdVliYiUK9kguQq4Cfi2u38N1Cc4vCUxGHBCC2ZfdwYntW7K+Cff4bd/Xc2+A8ptEamZkr1qqxDIAn5lZr8HTnf3FbFWVscdf3QmT4/tx48GdOCxNz9k1JRFbN6xO9VliYgcItmrtu4Erid4PMoa4DozuyPOwgQy0tO45fvdmPiDXqzbsosLJyxk4QefpbosEZFvSPbQ1iDgXHd/1N0fBQYCF8ZXliQa3KMNL43PpXmj+ox5dAkT//4BhbobXkRqiMo8/bdZQv/RVV2IlO+EVo15cdwAhpzaht//z/tcPS2PL77W3fAiknrJBskdwDtm9iczexxYBtweX1lSmkYN6vHAyJ7cNrQ7Cz7YxuAHF7Iyf0eqyxKROi7Zk+1PAf2Av4Rdf3d/Os7CpHRmxpj+2Tz77/0pLHSGP/wPnlzyke6GF5GUKTdIzKx3UQe0BvKBj4E24ThJkV7tj2HWdWfQt+Ox3DxjJT97bgW79+mFWSJS/epVMP3ecqY5et5WSh3bqD5/+uFpTHjtAyb8/QNWb9rBw5f3oUOLRqkuTUTqkHKDxN3r9BN+zQtxq9mvtU9PM/7j3C70at+MG55ZzpAHF3LPpacy8OTjU12aiNQRyd5HcnEp3Tlm1iruAlMp+8D/0nH/B/DGPfDpGqjB5yHOPrEVs67NpWPLRlwzfRm3z1nLgYO6G15E4lfRoa0iVwH9gdfD4bOBxUAXM7vV3f8cQ22p5c7OtKY0LdwJr/8u6I7tCCcNhq7fh7Y5kFaz9layjmnIs9f053ez1jJl/kaWf/wFEy/rRaummakuTURqsWT/EhYCXd19uLsPB7oBe4G+wH/GVVxKmVGQ3pL/zegEP10HF94Hx2TD4ofgj+fCfV1h1n/A+tfgQM25n6NBvXRuG3YyD4zsycr8HQyasJDFGwtSXZaI1GLJBkm2u3+aMLwV6OLunwP7y5rJzAaa2TozW29mN5XT7hIzczPLSRjXw8wWmdlqM1tpZpnh+HnhMpeHXfyH15ocD9++CsbMgBs3wMVToX1fePcZmH4x3HMCvHA1rH4R9taMR78P69WWF8cNoOlR9Rg9dQmT39igS4RFJBbJHtpaYGazgOfC4UsI3t3eCPiitBnMLB2YBJxLcNnwUjOb6e5rSrRrAlwHLEkYVw+YDoxx93fNrDnfDKzR7p6XZO1V66hm0OPSoNu/GzbOg7WzYN0cWPkcpDeATt+FroOhywXQqHlKygQ48fgmzByfy38+v4I7Xn6PZf/czu9HnErTzIyU1SQitU+yQTIOuBjIJXih1ePACx78i1vWlV2nAevdfSOAmT0NDCV46GOi24C7gZ8ljDsPWBG+QAt3r5nHZjKOghMvCLqDB+DjxUGovDcL3n8ZLA3anx6EykkXQrP21V5i4wb1mPiDXvR+8xjumLOWIQ8u5KHRfejWpmm11yIitVOyd7Y7sBD4O/AqMN8rPk7SluDmxSL54bhiZtYLaOfus0rM2wVwM5trZm+b2c9LTH8sPKz1azMr9U2NZjbWzPLMLG/btm0VlFoF0utBdi5ccCfcsBLGvgFn/BS+LoBXboIHToHJZ8Ibd1f7FWBmxlW5HXh6bD927z/IRQ+9yXN5H1c8o4hIEpK9/HcE8BbBIa0RwBIzu6Si2UoZV/zX08zSgPuBn5bSrh7B3s/o8OtFZnZOOG20u58CnBF2Y0r7cHef4u457p7TsmXLCkqtYmbQpid891cwbjFc+zace2tw2Ov1/4aH+8ODveF/fg0fvwWF1XOZbk72scy+7gz6fOsYbnx+Bb/4ywr27Nfd8CISTbKHtn5J8HbErQBm1pJgz+T5cubJB9olDGcBie+NbQKcDMwLdyqOB2aa2ZBw3jfc/bPw8+YAvYHX3P0TAHffZWZPEhxCm5bkeqRG804w4Pqg27UF3psdHP5a/BD8YwI0Ph5OGhQc/so+E+rVj62UFo0b8Oer+nLf39Yx6fUNrPxkBw+P7kO7YxvG9pkiUrsle9VWWlGIhAqSmHcp0NnMOphZfWAUMLNoorvvcPcW7p7t7tkE96UMCU+izwV6mFnD8MT7WcAaM6tnZi0AzCwDGAysSnIdaoZyrwAbXi1XgKWnGTeefxJTr8jho4KvuXDCAl5b+2nFM4qIlCLZPZJXzGwu8FQ4PBKYU94M7n7AzMYThEI68Ki7rzazW4E8d59Zzrzbzew+gjByYI67zw6vEpsbhkg6wV7RH5Jch5onxVeAfa/bccy69gx+/MQyrno8j3Hf6cT/PfdE0tNKPe0kIlKqpILE3W80s+HAAIJzH1PcfUYS882hROC4+y1ltD27xPB0gkuAE8d9BfRJpuYjToquAGvfvCEv/Ph0fjNzNZNe38A7H33BhMt60aJxgypZvojUfsnukeDuLwAvxFiLFCm6Aiw7FwbeAZvfDQLlvdnBFWCv3AStTw0e13LSYGjVNTjBf5gyM9K5c3gPen/rGH794iounLCAST/oTU72sVW4UiJSW5UbJGa2i4QrrRInEVwVrJsR4lZ0BVjRVWAFG4JQWTsruALs9f+usmeAjchpR/c2TfnJE28zaspifjGoKz8akE0ZV1iLiAAVP0a+SXUVIkmK+Qqw7m2OZub4XH723LvcNmsNb/9zO3dd0oPGDZLeeRWROkZ/HY5kRVeAffsq2P0FfPA3eO+vwRVgeY9Cg6Ohy3nB3soJ34MGjZNa7NFHZTBlTB8mz9/I3a+8x9otO3nk8j50OU7/V4jIoRQktUUVXwFmZlxzVidOzWrGtU+9w9CJb3LHxacwrFfbcucTkbpHQVIbVeEVYP07NWfOdbmMf/IdbnhmOcv+uZ1fDe5Kg3rp1bhCIlKTKUhquyq4AqxV00ye+D99uWfuOqbM38iKT3bw0OjetG12VIpWSkRqkpr1ij+JV+IzwH6yqFLPAMtIT+PmQV155PLebNj6JYMnLOCN96vhYZgiUuMpSOqyoivArv5bBW+BfLX4LZADT27NzPEDOK5pJlc+9hYPvPo+hYV6YZZIXaZDWxKoxBVgHU/4HjN+MoBfvriSB179gLc/+oIHRvZM9RqISIooSORQpV0B9t4sWPdy8RVgR3X6DveedCEDWp/KL17ZxPcfXMhPD7bmxPTNqa5eRKqZgkTKV84VYPb+Kwy3NM5vdxpTP+vGA19dwIkZW2jz0irS0ox6aVb8Nd2M9LQ00tMgPS3tG9O+2ebQLrk2/1p2uhnp6aW3rVfUb8EyRSQ6BYkkr4wrwBq/N5sbDjzKDZmw3Rvj7zyMu1FI0Dngxf3BH+9CD/pLjk8c/mZHqeMPAPtLm8cPnQegkLTiehwLr1Az3NKK+83+Nc0sLaE/aBc8MiateFxROyyteH6K2hVPM9KKxqcF0/bsDZ46sHPy+Gr9NsYn3mDevTcDgJ1Tro31c2qLou3VZd9eMurH+xBWBYkcnlKeAbbloUHU930c23tY+CphBy8s7vfCQgrdwZ3CwoO4e9gV4oWFCcMHoag/YTxeiIfL+9fwN/uDz0wYdgcP3gIZzHsgobawvlL6rWhc8XC4Hg5W1A7HitqWMpzmiZH1r/HfjDxg04pq/uYduQyHT7S9KuPgwQNkoCCRI0HzThSkB680Pnbw/aU2MYKXyJDwtS5bfXsuAN1vXpjiSo4M2l6VU7y9jmoU+2fp8l8REYlEQSIiIpEoSEREJBIFiYiIRKIgERGRSBQkIiISiYJEREQiUZCIiEgkuiGxHN1bH53qEkREajztkYiISCQKEhERiURBIiIikShIREQkEgWJiIhEoiAREZFIFCQiIhKJgkRERCKJNUjMbKCZrTOz9WZ2UzntLjEzN7OchHE9zGyRma02s5VmlhmO7xMOrzezCRa8QFtERFIktiAxs3RgEnAB0A24zMy6ldKuCXAdsCRhXD1gOnCNu3cHzgb2h5MfBsYCncNuYFzrICIiFYtzj+Q0YL27b3T3fcDTwNBS2t0G3A3sSRh3HrDC3d8FcPcCdz9oZq2Bpu6+yN0dmAYMi3EdRESkAnEGSVvg44Th/HBcMTPrBbRz91kl5u0CuJnNNbO3zeznCcvML2+ZIiJSveJ8aGNp5y68eKJZGnA/cGUp7eoBucC3ga+B18xsGbCzvGV+48PNxhIcAqN9+/aVqVtERCohzj2SfKBdwnAWsClhuAlwMjDPzD4E+gEzwxPu+cAb7v6Zu38NzAF6h+OzyllmMXef4u457p7TsmXLKlolEREpKc4gWQp0NrMOZlYfGAXMLJro7jvcvYW7Z7t7NrAYGOLuecBcoIeZNQxPvJ8FrHH3zcAuM+sXXq11BfBSjOsgIiIViC1I3P0AMJ4gFNYCz7r7ajO71cyGVDDvduA+gjBaDrzt7rPDyT8GpgLrgQ3AyzGtgoiIJCHWF1u5+xyCw1KJ424po+3ZJYanE1wCXLJdHsEhMRERqQF0Z7uIiESiIBERkUgUJCIiEomCREREIlGQiIhIJAoSERGJREEiIiKRKEhERCQSBYmIiESiIBERkUgUJCIiEomCREREIlGQiIhIJAoSERGJREEiIiKRxPo+kiPeD2dX3EZEpI7THomIiESiIBERkUgUJCIiEomCREREIlGQiIhIJAoSERGJREEiIiKRKEhERCQSBYmIiESiIBERkUgUJCIiEomCREREIlGQiIhIJAoSERGJREEiIiKRKEhERCQSBYmIiEQSa5CY2UAzW2dm683spnLaXWJmbmY54XC2me02s+Vh90hC23nhMoumtYpzHUREpHyxvWrXzNKBScC5QD6w1MxmuvuaEu2aANcBS0osYoO79yxj8aPdPa+qaxYRkcqLc4/kNGC9u290933A08DQUtrdBtwN7ImxFhERiUmcQdIW+DhhOD8cV8zMegHt3H1WKfN3MLN3zOwNMzujxLTHwsNavzYzq9qyRUSkMmI7tAWU9gfeiyeapQH3A1eW0m4z0N7dC8ysD/CimXV3950Eh7U+CQ+JvQCMAaYd8uFmY4GxAO3bt4+6LiIiUoY490jygXYJw1nApoThJsDJwDwz+xDoB8w0sxx33+vuBQDuvgzYAHQJhz8Jv+4CniQ4hHYId5/i7jnuntOyZcsqXTEREfmXOINkKdDZzDqYWX1gFDCzaKK773D3Fu6e7e7ZwGJgiLvnmVnL8GQ9ZtYR6AxsNLN6ZtYiHJ8BDAZWxbgOIiJSgdgObbn7ATMbD8wF0oFH3X21md0K5Ln7zHJmPxO41cwOAAeBa9z9czNrBMwNQyQdeBX4Q1zrIJXT/eaFqS5BRFLA3L3iVke4nJwcz8vT1cIiIpVhZsvcPaeidrqzXUREIlGQiIhIJAoSERGJREEiIiKRKEhERCQSBYmIiESiIBERkUgUJCIiEomCREREIqkTd7ab2Tbgn4c5ewvgsyosp6qorspRXZWjuiqnttb1LXev8Km3dSJIojCzvGQeEVDdVFflqK7KUV2VU9fr0qEtERGJREEiIiKRKEgqNiXVBZRBdVWO6qoc1VU5dbounSMREZFItEciIiKRKEhKMLNLzWy1mRWaWZlXO5jZQDNbZ2brzeymaqjrWDP7m5l9EH49pox2B81sediV9xbKqPWUu/5m1sDMngmnLzGz7LhqqWRdV5rZtoRtdHU11PSomW01s1JfC22BCWHNK8ysd9w1JVnX2Wa2I2Fb3VJNdbUzs9fNbG34u3h9KW2qfZslWVe1bzMzyzSzt8zs3bCu35bSJt7fR3dXl9ABXYETgXlAThlt0oENQEegPvAu0C3muu4Gbgr7bwLuKqPdl9WwjSpcf+AnwCNh/yjgmRpS15XAxGr+mToT6A2sKmP6IOBlwIB+wJIaUtfZwKzq3Fbh57YGeof9TYD3S/k+Vvs2S7Kuat9m4TZoHPZnAEuAfiXaxPr7qD2SEtx9rbuvq6DZacB6d9/o7vuAp4GhMZc2FHg87H8cGBbz55UnmfVPrPd54BwzsxpQV7Vz9/nA5+U0GQpM88BioJmZta4BdaWEu29297fD/l3AWqBtiWbVvs2SrKvahdvgy3AwI+xKnvyO9fdRQXJ42gIfJwznE/8P1HHuvhmCH2igVRntMs0sz8wWm1lcYZPM+he3cfcDwA6geUz1VKYugOHh4ZDnzaxdzDUlIxU/T8nqHx4yednMulf3h4eHYHoR/JedKKXbrJy6IAXbzMzSzWw5sBX4m7uXub3i+H2sV1ULOpKY2avA8aVM+qW7v5TMIkoZF/nyt/LqqsRi2rv7JjPrCPzdzFa6+4aotZWQzPrHso0qkMxn/hV4yt33mtk1BP+lfTfmuiqSim2VjLcJHpHxpZkNAl4EOlfXh5tZY+AF4AZ331lycimzVMs2q6CulGwzdz8I9DSzZsAMMzvZ3RPPfcW6vepkkLj79yIuIh9I/E82C9gUcZnl1mVmn5pZa3ffHO7Cby1jGZvCrxvNbB7Bf01VHSTJrH9Rm3wzqwccTfyHUSqsy90LEgb/ANwVc03JiOXnKarbn6z0AAADj0lEQVTEP5LuPsfMHjKzFu4e+zOlzCyD4I/1E+7+l1KapGSbVVRXKrdZ+JlfhL/3A4HEIIn191GHtg7PUqCzmXUws/oEJ69iu0IqNBP4t7D/34BD9pzM7BgzaxD2twAGAGtiqCWZ9U+s9xLg7x6e6YtRhXWVOI4+hOA4d6rNBK4Ir0TqB+woOoyZSmZ2fNFxdDM7jeDvRUH5c1XJ5xrwR2Ctu99XRrNq32bJ1JWKbWZmLcM9EczsKOB7wHslmsX7+1idVxccCR1wEUF67wU+BeaG49sAcxLaDSK4amMDwSGxuOtqDrwGfBB+PTYcnwNMDftPB1YSXK20ErgqxnoOWX/gVmBI2J8JPAesB94COlbT96+iuu4AVofb6HXgpGqo6SlgM7A//Nm6CrgGuCacbsCksOaVlHG1YArqGp+wrRYDp1dTXbkEh11WAMvDblCqt1mSdVX7NgN6AO+Eda0Cbinl5z7W30fd2S4iIpHo0JaIiESiIBERkUgUJCIiEomCREREIlGQiIhIJAoSkSpgZl9W3Krc+Z8Pn0aAmTU2s8lmtiF8mut8M+trZvXD/jp5I7HUXAoSkRQLn8eU7u4bw1FTCe467uzu3QmeWNzCgwdRvgaMTEmhImVQkIhUofBO63vMbJWZrTSzkeH4tPBxGavNbJaZzTGzS8LZRhM+qcDMOgF9gV+5eyEEj7tx99lh2xfD9iI1hnaRRarWxUBP4FSgBbDUzOYTPK4mGziF4MnNa4FHw3kGENxlDtAdWO7BQ/hKswr4diyVixwm7ZGIVK1cgqcLH3T3T4E3CP7w5wLPuXuhu28heDxLkdbAtmQWHgbMPjNrUsV1ixw2BYlI1SrrZUHlvURoN8GzkCB4TtOpZlbe72YDYM9h1CYSCwWJSNWaD4wMXzTUkuB1tm8BCwleqJVmZscRvJK1yFrgBAAP3h2TB/w24Smync1saNjfHNjm7vura4VEKqIgEalaMwiewvou8Hfg5+GhrBcInrC7CphM8Ga9HeE8s/lmsFxN8IKz9Wa2kuC9KUXv2vgOMCfeVRCpHD39V6SamFljD96c15xgL2WAu28J3yHxejhc1kn2omX8BfiFu6+rhpJFkqKrtkSqz6zwBUT1gdvCPRXcfbeZ/RfBe7U/Kmvm8GVdLypEpKbRHomIiESicyQiIhKJgkRERCJRkIiISCQKEhERiURBIiIikShIREQkkv8PudO7Qp5dD4kAAAAASUVORK5CYII=\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# plot CV误差曲线\n",
    "train_means = grid.cv_results_[ 'mean_train_score' ]\n",
    "train_stds = grid.cv_results_[ 'std_train_score' ]\n",
    "\n",
    "# plot results\n",
    "n_Cs = len(Cs)\n",
    "number_penaltys = len(penaltys)\n",
    "train_scores = np.array(train_means).reshape(n_Cs,number_penaltys)\n",
    "train_stds = np.array(train_stds).reshape(n_Cs,number_penaltys)\n",
    "\n",
    "x_axis = np.log10(Cs)\n",
    "for i, value in enumerate(penaltys):\n",
    "    #pyplot.plot(log(Cs), label= 'penalty:'   + str(value))\n",
    "    plt.errorbar(x_axis, -train_scores[:,i], yerr=train_stds[:,i] ,label = penaltys[i] +' Train')\n",
    "    \n",
    "plt.legend()\n",
    "plt.xlabel( 'log(C)' )                                                                                                      \n",
    "plt.ylabel( 'logloss' )\n",
    "plt.savefig('LogisticGridSearchCV_C.png' )\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "上图给出了L1正则和L2正则下、不同正则参数C对应的模型在训练集上的logloss。大致可以看出训练集上当C=1时性能最好（L1正则）.图形绘制的好像有点问题，希望在批改中老师能给我指出问题。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2、评价指标为正确率，用LogisticRegressionCV的5折交叉验证，对Logistic回归模型的正则超参数调优"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LogisticRegressionCV\n",
    "\n",
    "Cs = [1e-3, 1e-2, 1e-1, 1, 10, 100, 1000]\n",
    "#nCs = 9  #Cs values are chosen in a logarithmic scale between 1e-4 and 1e4.\n",
    "\n",
    "# 大量样本（6W+）、高维度（93），L1正则 --> 可选用saga优化求解器(0.19版本新功能)\n",
    "#LogisticRegressionCV比GridSearchCV快\n",
    "lrcv_L1 = LogisticRegressionCV(Cs=Cs, cv = 5, penalty='l1', solver='liblinear', multi_class='ovr',n_jobs=4)\n",
    "lrcv_L1.fit(X_train, y_train) \n",
    "\n",
    "# 用训练好的模型对测试集进行预测\n",
    "y_train_pred_lr =lrcv_L1.predict(X_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The AS of LogisticRegression on train is 0.7747395833333334\n"
     ]
    }
   ],
   "source": [
    "#用正确率来评价模型性能\n",
    "from sklearn.metrics import accuracy_score #评价回归预测模型的性能\n",
    "AS= (accuracy_score(y_train, y_train_pred_lr))\n",
    "print ('The AS of LogisticRegression on train is',AS)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 84,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 0.40965794,  1.12851509, -0.0929835 ,  0.02476064, -0.08406547,\n",
       "         0.62912594,  0.27933108,  0.14384827]])"
      ]
     },
     "execution_count": 84,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lrcv_L1.coef_"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
